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Abstract 

We describe the implementation and performance of the P 3 T (Particle-Particle 
Particle-Tree) scheme for simulating dense stellar systems. In P ? T, the force 
experienced by a particle is split into short-range and long-range contributions. 
Short-range forces are evaluated by direct summation and integrated with the 
fourth order Hermite predictor-corrector method with the block timesteps. For 
long-range forces, we use a combination of the Barnes-Hut tree code and the 
leapfrog integrator. The tree part of our simulation environment is accelerated 
using graphical processing units (GPU), whereas the direct summation is carried 
out on the host CPU. Our code gives excellent performance and accuracy for star 
cluster simulations with a large number of particles even when the core size of 
the star cluster is small. 

PACS numbers: 95.10.Ce, 98.10.+Z 

Keywords: methods: N-body simulations 


1 Background 

Direct IV-body simulation has been the most useful tool for the study of the evolu¬ 
tion of collisional stellar systems such as star clusters and the center of the galaxy [1] . 
The force calculations, of which the cost is 0(N 2 ), are the most compute-intensive 
part of direct JV-body simulations. Barnes and Hut [2] developed a scheme which 
reduces the calculation cost to O(NlogN) by constructing the tree structure and 
evaluating the multipole expansions. Dehnen [3, 4] developed a scheme to reduce the 
calculation cost to O(N) by combining the fast multipole method [5] and the tree 
code. Recently, the graphical processing units (GPU), which is a device originally 
developed for rendering the graphical image, began to be used for scientific simula¬ 
tions. The tree code is also implemented on GPUs and it is much faster than that on 
CPUs [6, 7]. Bedorf et al. [8] parallelized the tree code on GPUs and showed good 
scalability up to 18600 GPUs. They also simulated the Milky Way Galaxy with N 
of up to 242 billion and reported that the average calculation time per iteration on 
18600 GPUs was 4.8 seconds. 

The tree schemes are widely used for collisionless system simulations. However, for 
collisional system simulations, the use of the tree code has been very limited. One 
reason might be that a collisional stellar system spans a wide range in timescales. 
Thus it is essential that each particle has its own integration timestep. This scheme 
is called the individual timestep or the block timestep [9]. However, when we use 
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the tree code and the block timestep together, the tree structure is reconstructed 
at every block timestep, because the positions of integrated particle are updated. 
The cost of the usual complete reconstruction of the tree is O(NlogN) and not 
negligible. 

To reduce the cost of the reconstruction of the tree, McMillan and Aarseth [10] 
introduced local reconstruction of tree. They demonstrated a good performance, 
but there seems to be no obvious way to parallelize their scheme. 

Recently, Oshino et al. [11] introduced another approach to combine the tree code 
and the block timesteps which they called the P 3 T scheme. This scheme is based 
on the idea of Hamiltonian splitting [12-18]. In the P 3 T scheme, the Hamiltonian 
of the system is split into short-range and long-range parts and they are integrated 
with different integrators. The long-range part is evaluated with the tree code and is 
integrated using the leapfrog scheme with a shared timestep. The short range part 
is evaluated with direct summation and integrated using the fourth-order Hermite 
scheme [19] with the block timesteps. They investigated the accuracy and the per¬ 
formance of the P 3 T scheme for planetary formation simulations and showed that 
the P 3 T scheme achieves high performance. 

In this paper, we present the implementation of the P 3 T scheme on GPUs and 
report its accuracy and performance for star cluster simulations. We found that 
the P 3 T scheme demonstrates a very good performance for star cluster simulations, 
even when the core of the cluster becomes small. 

The structure of this paper is as follows. In section 2, we briefly describe the P 3 T 
scheme. In section 3, we report the accuracy and performance of the P 3 T scheme. 
We summarize these results in section 4. 


2 Methods 

2.1 Formulation 

In this section, we describe the P 3 T scheme. The Hamiltonian H of a gravitational 
TV-body system is given by 
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where Pi 7 rrii and q, are momentum, mass and position of the particle i, respectively. 
To avoid the singularity of the 1/r potential, we use the Plummer softening e [1], 
With the P 3 T scheme, H is split into ffhard and H so f t as follows [11]: 
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Here W(sij) is a smooth transition function. A suitable form of W(sij) should be 
zero when a distance between two particles is smaller than the inner cutoff radius r m 
and should be unity if the distance is larger than the outer cutoff radius r cut . This 
splitting is introduced by Chambers [15] to avoid undesirable energy error from close 
encounters between particles. Similar splitting has been used with P 3 M (Particle- 
Particle Particle-Mesli) scheme, in which the long-range part of the interaction is 
evaluated by using FFT [20]. 

Forces derived from F/hard and iF so ft are given by 


■^hard,i 

diehard rmrrij 
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We call K(sij 

) the cutoff function. 



The tree algorithm is used for the evaluation of F so f t i to reduce the calculation 
cost. 


The formal solution of the equation of motion for the phase space coordinate 
w = ( q,p ) at time t + St for the given Hamiltonian H is 

w(t + St) = e 5t{ ’ H} w(t) = e^fAoft+ffhard} w (t). (10) 


Here the braces {,} stand for the Poisson bracket. In the P 3 T scheme, we use the 
second order approximation; 


w(t + St) = e st{,Hh*rd} e st/2{,Hsoit} +0(6t 3 ). 


( 11 ) 


Here, the formal solution for the H so f t term is the simple velocity kick, since H so f t 
contains the potential only. We numerically integrate the F/hard term, since it can¬ 
not be solved analytically. We use the fourth-order Hernrite scheme with the block 
timestep [19]. The fourth-order integrator requires K(stj) to be three-times differ¬ 
entiable with respect to position. We use the following formula: 
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This K(x) is the lowest-order polynomial which satisfies the requirement that 
derivatives up to the third order is zero for x = 0 and 1 (i.e. The highest-order term 
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of the lowest-order polynomial is the seventh, because there are eight boundary 
conditions at x = 0 and x = 1 ). 

In figure 1, we plot K(y) (top panel) and forces (bottom panel) with 7 = 0.1. 
According to [11, 15], K(y) with 7 = 0.1, is smooth enough to be integrated. Thus, 
for all calculations, we use 7 = 0.1. The functional form of W(y,j) is given by 


{ 7( 7 s —97 s +457 4 —607 3 log7—457 2 + 97 —1) ^ ^ ^ 

G{yn) + (1 - G(l\~f))y (7 < y < 1) > ( 16 ) 

1 (1 < y) 

G{y,^) = (-10/3y 7 + 14(7 + l)y 6 - 21(7 2 + 37 + l)?/ 5 
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+ 210 ( 7 3 + 7 2 )y 2 - 1407 3 j/log(y) 

+(7 7 - 7 7 6 + 21 7 5 - 35 7 4 )) /(I - 7) 7 - (17) 

With the P 3 T scheme, the time integration proceeds as follows 
1 At time f, by using the tree code, calculate the acceleration due to H so [ t , a so f t) j, 
and construct a list of all particles which come within r cu t from particle i for 
Afsoft. Here, Ai so f t is the timestep for the soft Hamiltonian. 

2 Update the velocities of all particles with v new ^ = u 0 id,i + (l/2)Af so ft a sofM- 
3 Integrate all particles to time t + Af so f t under 7?h a rd , using the neighbour list 
and the fourth order Hermite integrator with the block timesteps. 

4 Calculate the acceleration due to H so f t at new time t + Af so f t and update the 
velocity 

5 Go back to step 2. 

For the timestep criterion for the block timestep, we use the following form [11]. 



110 , , 

a 0 = a^—. (19) 

Uut 


Here 77 is the accuracy parameter of the timestep and its typical value is 0.1. 
Af max is the maximum timestep which should be smaller than Af so f t , a^ is the 
nth time derivative of the acceleration of particle *, ao is a constant introduced to 
prevent At* from becoming too small when the distance to the nearest neighbor is 
close to r cut and a is a parameter to control ao- In this case, the acceleration from 
-Hhard becomes very small and there is no need to use very small At*. According to 
[11], when we chose a < 1, a hardly affects the energy error. Thus we set a = 0.1 
for all simulations. 

(2) (3) 

In our Hermite implementation, a> ' and a\ ' are derived using interpolation of 
a-°^ and a- 1 ^, and as a consequence we cannot use equation (18) for the first step. 
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We use: 


A ti = min 
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This criterion dose not contain the 2nd and 3rd time derivatives of the acceleration. 
To prevent the timestep derived by equation (20) from becoming too large, we set 
r] s to be the one-tenth of r/ for all simulation in this paper. 

We summarize all accuracy parameters in table 1. 


2.2 Implementation on GPUs 

Even with the Barnes-Hut tree algorithm, obtaining -F S oft,i is still costly and dom¬ 
inates the total calculation time [11]. To accelerate this part, we use GPUs, by 
modifying the sequoia library (Bedorf, Gaburov and Portegies Zwart, submitted 
to ComAC), on which the high-performance tree code for parallel GPUs Bonsai [7] 
is based. Our library calculates the long range forces on all particles, F so f tii by the 
Barnes-Hut tree algorithm (up to the quadrupole moment). On the other hand, we 
calculate -Fhard,i on the host computer. The library also returns, for each particle, 
the list of particles within the distance h from it. We use this list of neighbors to 
calculate -Fhard.i- The value of h should be sufficiently larger than r cut to guarantee 
that the particles which are not on the list of the neighbors of particle i do not enter 
the sphere of the radius r cu t around particle i during the time interval At so f t . 

We call the sphere with a radius of r cu t the neighbor sphere and the shell between 
the sphere with a radius of h and the neighbor sphere the buffer shell. The particles 
of which the nearest neighbor is outside the sphere with radius h are considered 
isolated and the particles on the list of neighbors are considered neighbor particles. 
We denote the width of the buffer shell as Arbuff (he. h = r cu t + Arbuff)- 
The compute procedures of our implementation of the P 3 T scheme on GPU is as 
follows: 

1 Evaluate long range forces on all particles -F so ft,i using GPU. 

2 Particles are divided into two groups; isolated and non-isolated, by using the 
neighbour list made on GPU. 

3 For non-isolated particles, Fhard.i are calculated on the host computer. 

4 All particles receive a velocity kick through F so f t ,i for Af so ft/2. 

5 Isolated particles are drifted by n n + AtsohVi. 

6 Non-isolated particles are integrated with the fourth-order Hermite scheme 
for At so ft. 

7 Evaluate -F S oft,i and make the neighbour list in the same way as in step 1-2. 

8 All particles obtain the velocity kick again for Af so f t /2. 

9 go back to step 3. 

3 Results 

3.1 Accuracy and Performance 

We performed a number of test calculations using the P 3 T scheme on GPUs, to 
study its accuracy and performance. In this section, we describe the result of these 
tests. For most of them we adopted a Plummer model [21] with 128K (hereafter 
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K=2 10 ) equal-mass particles as the initial condition. We use the so-called IV-body 
unit or Heggie unit, in which total mass M=l, the gravitational constant G=1 and 
total energy E = —1/4 [22]. To avoid the singularity of the gravitational potential, 
we use the Plummer softening and set e = 4/TV. Since this value is a typical separa¬ 
tion of a hard binary in the IV-body unit, we can follow the evolution of the system 
up to the moment of the core collapse. 

Note, in this paper, we use the energy errors as an indicator of the accuracy of the 
scheme. However, energy conservation dose not guarantee accuracy of simulations 
(though, it is necessary). Thus we will perform realistic simulations in section 3.2 
and check the statistical character of stellar systems by comparing the results with 
the Hermite scheme, which is widely used in collisional stellar system simulations. 
As we will see later, for simulations of the core collapse of the star cluster, when the 
relative energy error is < 10~ 3 at the moment of the core collapse, the behavior of 
the core collapse with the P 3 T scheme agreed with that with the Hermite scheme 
very well. 

3.1.1 Accuracy 

With the P 3 T scheme, we have six accuracy parameters. In sections 3.1.1.1-3.1.1.3, 
we discuss how each parameter controls the accuracy of the P 3 T scheme. In section 
3.1.1.4, we describe the accumulation of the energy error in a long-term integration. 
To measure energy errors accurately, we calculate potential energies by the direct 
summation instead of the tree code for all runs in this paper. 

3.1.1.1 Effect of r cu t, Af so f t and 6 In figure 2, we present the maximum relative 
energy error | AA max /Ao| over 10 A-body time units as a function of r cut and Af so ft 
for several different values of the opening criterion of the tree, 6. Here AA max is 
the maximum energy error and Eo is the initial energy. We chose = 0.1, Af max = 
Afsoft /4 and Arbuff = 3crAt so ft, where a is the global three dimensional velocity 
dispersion and we adopt a = 1/\[2. 

We can see that the error is smaller for smaller 9, smaller Af so ft, or larger 
r cu t- Roughly speaking, the error depends on two terms, At so f t /r cu t<7 and 9. If 
Af so ft/r cu t<7 is large, it determines the error. In this regime, the error is dominated 
by the truncation error of the leapfrog integrator. If it is small enough, 9 determines 
the error, in other words, the tree force error dominates the total error. Even for a 
very small value of 9 like 0.2, the tree force error dominates if Atgctt/fcut^ 1$ 0.05. 

In figure 3, we plot the maximum energy error as a function of 9. We use the 
same r), Af max and Arbuff as in figure 2. For the runs with r cu t = 1/256 and 
At so ft = 1/512, the energy error does not drop below 10 -6 because the error of 
the leapfrog integrator is larger than the tree force error. In an chaotic system 
like the model used in our simulations such energy error is sufficient to warrant a 
scientifically reliable result [23]. On the other hand, for the run with r cut = 1/128 
and Af so ft = 1/1024, integration error is smaller than the tree force error. 

3.1.1.2 Effect of Arbuff In figure 4, we show the maximum relative energy error 
as a function of Arbuff for the runs with Af max = Af so ft/4, 77 = 0.1, 9 = 0.2, for 
(Afsoft, Tout) = (1/512, 1/128) and (1/1024, 1/256). The energy error is almost 
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constant for Arbuff ^ 2Af so fter, which indicates that the energy error for Arb u ff < 
2 At so ft<r is caused by particles that are initially outside the buffer shell (with radius 
r cu t + Arbuff) and plunge into the neighbour sphere (with radius r cu t) during the 
timestep Af so f t . We can prevent this by adopting Arbuff > 2Ai so f t <r. 


3.1.1.3 Effect of At max and r] The maximum relative energy errors over 10 N- 
body time units are shown in the top panel of figure 5 as a function of r/ and the 
number of steps for the Hermite part (per particle per unit time, N step ) are presented 
in the bottom panel. The energy errors go down as r/ decrease until r) ~ 0.2. For 
77 < 0 . 2 , the errors hardly depend on Af max . 


3.1.1.4 Long term integration In figure 6 , we show the time evolution of the rela¬ 
tive energy error until T=500. We compare the accuracy of our P 3 T scheme with 
two other schemes, the direct fourth-order Hermite scheme and the leapfrog scheme 
with the Barnes-Hut tree code. The calculations with the direct Hermite scheme are 
performed by using the Sapporo library on GPU [24], and the calculations with the 
leapfrog scheme are performed by using the Bonsai library on GPU [7]. The energy 
error of the P 3 T scheme behaves like a random walk whereas that of the leapfrog 
and the Hermite schemes grow monotonically. In the right-hand panels of figure 6 , 
we show the same evolution of the error as in the left panels, but time is plotted with 
a logarithmic scale. This allows us to realize that the error growth of Hermite and 
tree schemes are linear, whereas the error in the P 3 T scheme grows as oc T 1 / 2 . This 
latter proportionality is caused by the short-term error of the P 3 T scheme, which 
is dominated by the randomly changing tree-force error. For long-term integration 
the P 3 T scheme conserves energy better than the Hermite or leapfrog schemes. 


3.1.2 Calculation cost 

In this section, we discuss the calculation cost of the P 3 T scheme and its dependence 
on the number of particles TV, required accuracy, and other parameters. 

We first construct a simple theoretical model of the dependence of the calculation 
cost on parameters of the integration scheme such as TV, Af so ft, 6 and r cu t in section 
3.1.2.1. In section 3.1.2.2 we derive the optimal set of parameters from the model 
and compare this model with the result of the numerical tests. We found that the 
calculation cost per unit time is proportional to TV 4 / 3 . 


3.1.2.1 Theoretical model The calculation cost for the force evaluations in P 3 T 
is split into the tree part and the Hermite part. For the tree part, the calcu¬ 
lation cost of evaluating forces for all particles per tree step is proportional to 
O(0~ 3 NlogN). Since we use constant timestep for the tree part, the calculation 
costs of the integration of particles per unit time for the tree part is proportional 
to O (0 _ 3 TVlogTV/Afsoft) ■ 
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For the Hermite part, since each particle has its own neighbour particles and 
timesteps, the number of interactions for all particles per unit timstep is given by 

N 

Ni „ t ,hard — E -^ngh,i-ZVstep,i (21) 

i 

N 

~ ^ 4 V3(r C ut + Ar buff ) 3 n l (A^)' 1 (22) 

i 

oc N 2 (r cut + Ar bu ff) 13 (( At)) ~ 1 , (23) 

Here N ng bi j is the number of the neighbour particles around particle i, N stePt i is 
the number of timesteps required to integrate particle i for one unit time, rii is the 
local density around particle i, (A ti) is the average timestep of particle i over one 
unit time and ((At)) is the average of (Atj) over all particles. Here we assume rii 
is constant within the radius of r cut + Ar bu ff around particle i. 

Next we express the ((At)) as a function of N and r cu t- To simplify the discussion, 
we define the timestep of the particle through the relative position and velocity 
from its nearest neighbour particle; ((At)) oc Tnn/vnn, where tnn and unn are the 
relative position and the velocity of the nearest neighbour particle. We can replace 
unn to the velocity dispersion a. Thus average timestep is given by 

((At)) oc tnn/i’nn ~ ^nn/ct. (24) 

To further simplify the derivation we assume that the number density of particles 
in the system is uniform. If r cu t is larger than the mean inter-particle distance (r) 
(i.e. if most particles have neighbour particles), the average timestep is roughly 
given by 

((At)) ~ min fj 7 ^ 1 V -1/3 , At max ^ , (25) 

where R is the typical size of the system. In this case, the average timestep depend 
only on N (dose not depend on r cut ). 

If r’cut is small compared to (r), most particles are isolated and most of the non¬ 
isolated particles have only one neighbour particle. In this case, ((At)) is given by 

((At)) - min At max j . (26) 

In figure 7 we show the number of steps per particle per unit time tV s tep for a 
plummer sphere as a function of N (top panel) and as a function of r cut (bottom 
panel). In the top panel, we can see that A step is roughly proportional to N 1 / 3 
for large N (i.e. (r) is small). On the other hand when N is small N s tep is almost 
constant because (r) is large [see equation (26)]. 

The bottom panel of figure 7 shows that all curves eventually approach to constant 
values for both of large and small r cut . For large r cu t, the timesteps of the non¬ 
isolated particles are determined by N, not by r cu t [see equation (25)], whereas 
for small values of r cu t the non-isolated particles have a timesteps Ai max . This is 
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because most neighbouring particles are in the buffer shell and not in the neighbour 
sphere. For runs with Ai so ft=1/2048, 1/1024 and 1/512, we can see bumps of A s t ep 
at fcut ~ 1/512 due to the dependence on r cut shown in equation (26). 

Using above discussions, the number of interactions for all particles per unit time 
of the Hermite part Xj nt ,h a rd and the tree part -/Vint,soft are given by 


N int 

,hard 
N int,soft 


^ f A 7 / 1 * 3 (r cut + Ar bu ff ) 3 
\ N 2 (r cut + Ar bu ff ) 3 

oc f ?~ 3 AlogA/ At so ft, 


(for r cut > (r)) 
(for r cut < (r)) 


(27) 

(28) 


3.1.2.2 Optimal set of accuracy parameters In this section, we derive the optimal 
values of r cut and Af so f t from the point of view of the balance of the calculation 
costs between the tree and the Hermite parts, in other words we express r cu t and 
At S oft as functions of A such that Wnt.hard/A r i n t,soft is independent of A. Following 
the discussion in section 3.1.1.1 and because the energy errors can be controlled 
through At so ft/r cu t, r cu t should be proportional to Ai so ft. From section 3. 1 . 1 . 2 , 
Ar bu ff should be also proportional Af so ft. 

The requirements are met for Ai nt ,hard oc N 7/3 (r cut + Ar bu ff ) 3 (or A 2 (r cut + 
Ar buf f) 3 ), At so ft oc A -1 / 3 and r cut oc A -1 / 4 and both Amt,hard and A in t,soft are 
proportional to A 4 / 3 (or A 5 / 4 ). Here we have neglected the log A dependence in 
the tree part. 

This is illustrated in figure 8 , where we plot Ai„t,hard for a plummer sphere as a 
function of A. Following above discussions, we use the A-dependent tree timestep: 
Atsoft = (1/256)(A/16K) -1 / 3 and Aj nt ,hard as well as A; n t, so ft are proportional to 
A 4 / 3 . 

In figures 9 and 10, we plot the wall-clock time of execution T ca i and the max¬ 
imum relative energy errors \/^E max /E q\ for the time integration for 10 A-body 
units against A. Top (bottom) panel in figure 9 shows the results of the runs with 
r cu t/At S oft=2 (top panel) and 4 (bottom panel). All runs in these figures are carried 
out on NVIDIA GeForce GTX680W GPU and Intel Core i7-3770K CPU. For each 
run, we use one CPU core and one GPU card. 

We also perform the simulations using the direct Hermite integrator with the same 
77 and the standard tree code with the same 9 and Af so ft- These calculations are 
performed with the Sapporo GPU library [24] and a standard tree code with the 
same 9 and Af so ft using the Bonsai GPU library [7]. The calculation time for our 
P 3 T implementation is also proportional to A 4 / 3 , as we presented in section 3.1.2.1, 
while for the Hermite integrator it is proportional to A 7 / 3 . The P 3 T scheme is faster 
than the direct Hermite integrator for A > 16K and when A=1M (M=2 20 ), the 
P 3 T scheme is about 50 times faster than the direct Hermite scheme. The pure tree 

[1) GTX680 does not have EGG (Error Check and Correct) memories. However, as 
we will see later, we do not observe any large energy error in all of our runs, which 

means the hardware error dose not affect our result. Betz, DeBardeleben and Walker 

[25] performed Molecular Dynamics simulations, in order to investigate the rate 

of bit-flip error events. They observed a single bit-flip error event in about 4700 

GPU*hours without EGG and conclude that the bit-flip error is exceedingly rare. 
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code is slightly faster than the P 3 T scheme, but the integration errors are worse by 
several orders of magnitude (see figure 6 and 10). 

3.2 Examples of practical applications 

In sections 3.1.1 and 3.1.2, we presented a detailed discussion on the accuracy and 
performance of our P 3 T scheme. However, we performed simple simulations, where 
the stellar systems are in the dynamical equilibrium. In this section, we study the 
performance of our P 3 T scheme when applied to more realistic, or more difficult, 
simulations by comparing the results of the Hermite scheme. In section 3.2.1, we 
discuss the case of the simulation of star clusters up to core collapse. In section 
3.2.2, we discuss the case of a galaxy model with massive central black hole binary. 

3.2.1 Star cluster down to core collapse 

In this section, we discuss the performance of our P 3 T scheme for the simulation 
of the core collapse of a star cluster. In section 3.2.1.1, we describe the initial 
condition and parameters of the integration scheme. In section 3.2.1.2 we compare 
the calculation results obtained by the P 3 T and Hermite schemes, and in section 
3.2.1.3 the calculation speed. 

3.2.1.1 Initial conditions We apply the P 3 T scheme to the evolution of a star 
cluster consisting of 16K stars to the moment of the core collapse [26]. We use an 
equal-mass plummer model as an initial density profile and we adopt p = 0.1. We 
apply the Plummer softening e = 4/N = 1/4096. The simulations are terminated 
when the core number-density exceeds 10 , at which point the mean interparticle 
distance in the core is comparable to e. Next, we set 9. We must chose 9 so that the 
tree force error is smaller than the force due to the two-body relaxation. Hernquist 
et al. [27] pointed out that, for 9 = 0.5 with monopole and quadrupole, the tree- 
force error is much smaller than the force due to the two-body relaxation. Thus 
we chose 9 = 0.4 with quadrupole as a standard model. For comparison, we also 
perfrome a run with 9 = 0.8. 

To resolve the motions of the particles in the core, we impose Af so f t to be smaller 
than 1/128 of the dynamical time of the core (~ \/37r/16p C ore, where p CO re is the core 

— 1/3 

density). To reduce the calculation cost for the Hermite part we require r cu t oc /tWe 
and set the initial value of r cu t = 1/64. We also change Arbuff = 3<T C oreAt so ft, where 
Ucore is the velocity dispersion in the core, and At max = Af so f t /4, as Ai so f t and 
trcore are changing. Here, to calculate p co re and <7 core , we use the formula proposed 
by Casertano and Hut [28] . The same simulation is repeated using the fourth-order 
Hermite scheme with the block timesteps with the same value of p = 0.1. 

3.2.1.2 Results In figure 11 we present the evolution of the core densities p cole (top 
panel) and the core radii r cole (bottom panel) for P 3 T and Hermite schemes. For 
each scheme, we perform three runs, changing the initial random seed for generating 
the initial conditions of the Plummer model. The behaviors of the cores for all 
runs are similar. The differences between two schemes are smaller than run-to-run 
variations. 

Figure 12 shows the relative energy errors of the runs with the same initial seed as 
functions of the core density (top panel) and the time (bottom panel). The energy 
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errors of the runs with P 3 T scheme change randomly, whereas those of the Hermite 
code grow monotonically. As a result, the P 3 T scheme with 6 = 0.4 conserves energy 
better than the Hermite scheme in the long run. The errors for the P 3 T scheme with 
6 = 0.8 is slightly worse than that of the Hermite scheme, but the behavior of the 
core are similar with other runs. Thus the choice of 9 = 0.4 is enough to follow the 
core collapse simulations. 

3.2.1.3 Calculation speed Figure 13 shows the calculation time of the P 3 T scheme 
(9 = 0.4) and Hermite scheme on GPU. As shown in this figure, the calculation time 
of the P 3 T scheme is dominated by the tree (soft) part calculation. 

Initially the P 3 T scheme is much faster than the Hermite scheme, but after the 
time when p c ore ~ 10 4 , the P 3 T scheme is slightly slower than the Hermite scheme 
because in the P 3 T scheme, Ai so f t is proportional to p c ore 2 - However, even for the 
P 3 T scheme, the CPU time spent after p c ore reaches 10 4 is small. As a result, the 
calculation time to the moment of the core collapse of the P 3 T scheme is smaller 
than that of the Hermite scheme by a factor of two. 

3.2.2 Orbital evolution of SMBH binary 

In this section, we also discuss the performance of the P 3 T scheme applied to 
simulations of a galaxy with a supermassive black hole (SMBH) binary. In section 
3.2.2.1, we describe the initial conditions and parameters of the integration scheme. 
In section 3.2.2.2 we compare the calculation results obtained by the P 3 T and 
Hermite schemes, and in section 3.2.2.3 the calculation speed. 

3.2.2.1 Initial conditions and methods We use the Plummer model with A^=16K, 
128K and 256K as the initial galaxy model. Two SMBH particles with a mass 
of 1 % of that of the galaxy are placed at the positions (± 0.5, 0.0, 0.0) with 
the velocities (0.0, ± 0.5, 0.0). We use three values for the cut off radius with 
respect to three different kinds of interactions. For the interaction between field stars 
(FSs), we set r cut ,FS-FS = 1/256. For the interaction between SMBHs, the force 
is not split and F so ft = 0. In other words, the force between SMBHs is integrated 
with the pure Hermite scheme. We set the cut off radius between SMBH and FS 
fcut.BH— fs = 1/32 which is large enough that At so f t is smaller than the Kepler 
time of a particle in orbit around the SMBH binary at a distance of r cutl BH-FS- 
We use the Plummer softening e = 10 -4 for the interactions between FS-FS and 
FS-SMBH. For the SMBH-SMBH interaction, we do not use the softening. The 
accuracy parameter of timestep criterion for FS ?7 fs is 0.1, and for SMBH ?;bh is 
0.03. We adopt Ar bu ff = 3crAf soft , At max = Af soft /4 and 0 = 0.4. 

We use Af so ft = 1/1024 at T = 0, and as the binary becomes harder, we decrease 
A£ so ft to suppress the aliasing error of the binary. As a standard model, we set 
Afsoft to be less than half of the Kepler time of the SMBH binary tkep- Only for 
N =128K, we also perform two other runs, where Ai so ft < tkep/4 and tkep- 

We also perforin the same simulations by the Hermite scheme with the same ?;fs 
and ?7 bh- 
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3.2.2.2 Results Figure 14 shows the evolution of the semi-major axis (top panel) 
and eccentricity (middle panel) of the SMBH binary and the relative energy error 
(bottom panel) as functions of time for our standard models (At so ft < tke P /2). 
The behaviors of the semi-major axis of the SMBH binary for the runs with the 
same N agree very well. The hardening rate of the binary depends on N because 
of the loss-cone refilling through the two-body relaxation [29-31]. The evolution 
of the eccentricity has large variation, because this evolution is sensitive to small 
N fluctuation [32], In the cases of iV=16K with the Hermite scheme, the relative 
energy error increases dramatically after T = 150 because the binding energy and 
the eccentricity of the binary are very high. 

Figure 15 is the same as figure 14 but for several different values of At so ft- Thick 
solid, dashed and dotted curves indicate the results for At so ft < tke P /4, tk ep /2 and 
tke P , respectively. The orbital parameters show similar behaviors for all runs. The 
absolute value of the energy errors of P 3 T runs (~ 10~ 5 ) are small compared with 
the binding energy of SMBH binary, which is roughly 0.05. 

3.2.2.3 Calculation speed Figure 16 shows the calculation time for runs for several 
different values of N with At so f t < tke P /2. Initially, the P 3 T scheme is much faster 
than the Hermite scheme. As the SMBH binary becomes harder, the P 3 T scheme 
slows down more significantly than the direct Hermite scheme does. We can see 
that T ca i of the Hermite scheme is roughly proportional to a -1 for a -1 > 300, 
whereas that of the P 3 T scheme is roughly proportional to a~ 5 ^ 2 , because At so ft 
is proportional to the Kepler time of the binary (oc a 3 / 2 ). However, the calculation 
time for all runs with the P 3 T scheme is shorter than that with the Hermite scheme 
by a = 1 /800. We can also confirm that as we use more N, the ratio of the calculation 
time of the P 3 T scheme to the Hermite scheme become larger. The reason why the 
P 3 T scheme becomes slower for large a -1 is simply that we force the timestep of all 
particles to be smaller than the orbital period of the SMBH binary. For the Hermite 
scheme, we do not put such constraint. Thus, in the Hermite scheme, particles far 
away from the SMBH have the timestep much larger than the orbital period of 
the SMBH binary. This large timestep can cause accuracy problem [33]. With P 3 T, 
it is possible to apply the perturbation approximation to F so f t between the SMBH 
binary and other particles. Such a treatment should improve the accuracy and speed 
of the P 3 T scheme when SMBH binary becomes very hard. 

In figure 17, we plot the calculation time of the hard and soft parts for the standard 
model with iV=128k. We can see that the soft parts dominate the calculation time. 

In figure 18, we compare the calculation time for the runs with various At so ft ( 
< tke P , tke P /2, fke P /4). Since the most of the calculation time is spent after the 
binary becomes hard, the calculation time strongly depends on the criterion of the 
Atsoft- From figure 15, the evolution of the orbital parameters for all runs with 
the P 3 T scheme are similar for various At so ft criterion. Thus we could chose larger 
At S oft > tk ep after the binary formation. 

4 Conclusions 

We described the implementation and performance of the P 3 T scheme for simulating 
dense stellar systems. In our implementation, the tree part is accelerated using GPU. 
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The accuracy and performance of the P 3 T scheme can be controlled through six 
parameters: Ar cut , Ar buff , At soft , AT max , ij and 9. We find that Ar buff > 2aAt aoh is 
good choice to prevent non-neighbour particles from entering the neighbour sphere. 
The integration errors can be controlled through At so f t /Ar cut cr. For 9 = 0.2, if we 
set At so ft to be less than 0.05A7" cu t/<T, the integration error is smaller than the tree 
force error. For the Hermite part, if we chose 77 < 0.2, the errors hardly depend on 
Af max . 

From the point of view of the balance of the calculation costs between the tree 
and Hermite parts, we derive the optimal set of accuracy parameters, and found 
that the calculation cost is proportional to TV 4 / 3 . 

The P 3 T scheme is suitable for simulating large TV stellar clusters with a high den¬ 
sity contrast, such as star clusters or galactic nuclei. We demonstrate the efficiency 
of the code and show that it is able to integrate TV-body systems to the moment of 
the core collapse. We also performed the simulations of the galaxy with the SMBH 
binary and found that the P 3 T scheme can be applied to these simulations. 

Finally, we discuss the possibilities of implementation of two important effects on 
star cluster evolution to P 3 T. The first is an effect of a tidal field which dramatically 
change the collapse time and the evaporation time of a star cluster. The tidal field 
effect can be included in the soft part. 

The other is an effect of the stellar-mass binary. A stellar-mass binary plays an 
important role in halting the core collapse. In this paper, we introduce the Plummer 
softening and neglect these binary effect. However, we could treat these effects by 
integrating stellar-mass binaries in the hard part. 

Our P 3 T code is incorporated in the AMUSE frameworks and free for use [34, 35]. 
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Figure 1 The cutoff function K(y) (top) and the forces (bottom) as functions of 

y —— &ij / T cut • 
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Figure 4 Maximum relative energy error as a function of Arbuff in unit of At so ftcr- Here a is 
the global three dimensional velocity dispersion of the system (= l/y/2). For all runs, we use 
V = 0.1, At so ft = 1/512, tmax = At S oft/4, 6 = 0.1 and Arbuff = 3<rA£ so ft- 
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Figure 5 Maximum relative energy error and the steps for the Hermite part against 77 . Top and 

bottom panels show the maximum relative energy error and the steps for the Hermite part par 
particle par unit time against rj, respectively. 




Figure 6 Evolution of relative energy errors with various schemes. We use A£ ma x = A£ so ft/4, 

Arbuff — 3<rAt S oft for all runs and 6 = 0.4 for the tree code and rj = 0.1 for the Hermite scheme. 
In left and right panels, the x-axes are linear and logarithmic scales, respectively. Thin curves in 
right panels are proportional to T (solid) and T 1 / 2 (dashed). 
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Figure 9 Wall-clock time of execution as a function of AT.Top (bottom) panel shows the results 
of the runs with r cu t/Af so ft = 2(4). We use 6 = 0.4, 77 = 0.1, Arbuff — 3<rAt so ft and 
At soft = (l/256)(iV/16K)- 1 / 3 . 



Figure 10 Maximum relative energy errors over 10 TV-body time units. All runs are the same as 
those in figure 9. 
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Figure 11 Time evolution of the core density (top) and the core radius (bottom). Thick and 
thin curves show the results of the P 3 T and Hermite scheme, respectively. The curves for different 
runs are vertically shifted by a factor of 8 (top) and 2 (bottom). 
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Figure 12 Relative energy error as functions of pcore (top) and time (bottom). 
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Figure 13 Wall-clock time of execution as functions of pcore. In top and bottom panels, the 
y-axes are logarithmic and linear scales, respectively. 
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Figure 15 Evolution of semi-major axis (top), eccentricity (middle) of the SMBH binary and 
energy error (bottom) for the several different valuse of At so ft- Thick solid, dashed and dotted 
curves show the results of P 3 T scheme with A£ so f t is less than tkep/4, ^kep/2 and tkep. 
respectively. 
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Figure 16 Wall-clock times as a function of 1/a (top and middle) and the system time of 
the simulations (bottom) for several different values of N. In top and middle panels, the x- and 
y-axis are logarithmic and linear scales, respectively. In bottom panel, x-axis is scaled by 7V/16K. 
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Figure 17 Wall-clock times as a function of 1/a. In top and bottom panels, the x- and y-axis 
are logarithmic and linear scales, respectively. 





















Iwasawa et al. 


Page 26 of 27 



1 10 100 1000 


1/a 



Figure 18 Wall-clock time as a function of 1/a for several different values of At so ft. In top 
and bottom panels, the x- and y-axis are logarithmic and linear scales, respectively. 
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Tables 

Table 1 Symbols and definitions for the accuracy parameters of the P 3 T scheme 
a timestep softening. For all runs, a = 0.1. 

7 ratio of inner and outer cutoff radius (^i n /r cu t)- For all runs, 7 = 0.1. 

A^buff width of the buffer shell. Arbuff — 3crAt so ft, as a standard value. 

Aigoft timestep of the soft part. At so f t = (1/256) (77/16K) — - 1 / 3 , as a standard value. 

At m ax maximum timestep of the hard part. At ma x = At so f t /4, as a standard value, 
e plummer softening length, e = (4/77), as a standard value. 

77 accuracy parameter for timestep criterion. 77 = 0 . 1 , as a standard value. 

t* cut outer cutoff radius of smooth transition functions W and K. r cu t = 4At so f t , as a standard value. 

rj n inner cutoff radius of smooth transition functions W and K (rj n = 7r cu t)- 

6 opening criterion for tree. 6 = 0.4, as a standard value. 




